Data visualization with ggplot2

Some of the contents of this document have been adapted from Wilke (2019).Take a look at the online version of the booko for further insights.

Introduction

Data visualization is part art and part science. The challenge is to get the art right without getting the science wrong and vice versa. A data visualization first and foremost has to accurately convey the data. It must not mislead or distort. If one number is twice as large as another, but in the visualization they look to be about the same, then the visualization is wrong. At the same time, a data visualization should be aesthetically pleasing. Good visual presentations tend to enhance the message of the visualization. If a figure contains jarring colors, imbalanced visual elements, or other features that distract, then the viewer will find it harder to inspect the figure and interpret it correctly.

Scientists frequently (though not always!) know how to visualize data without being grossly misleading. However, they may not have a well developed sense of visual aesthetics, and they may inadvertently make visual choices that detract from their desired message. Designers, on the other hand, may prepare visualizations that look beautiful but play fast and loose with the data.

The book attempts to cover the key principles, methods, and concepts required to visualize data for publications, reports, or presentations. Because data visualization is a vast field, and in its broadest definition could include topics as varied as schematic technical drawings, 3D animations, and user interfaces, I necessarily had to limit my scope for this book. I am specifically covering the case of static visualizations presented in print, online, or as slides. The book does not cover interactive visuals or movies, except in one brief section in the chapter on visualizing uncertainty. Therefore, throughout this book, I will use the words “visualization” and “figure” somewhat interchangeably. The book also does not provide any instruction on how to make figures with existing visualization softwares or programming libraries. The annotated bibliography at the end of the book includes pointers to appropriate texts covering these topics.

Ugly, bad, and wrong figures

Throughout this book, I frequently show different versions of the same figures, some as examples of how to make a good visualization and some as examples of how not to. To provide a simple visual guideline of which examples should be emulated and which should be avoided, I am clearly labeling problematic figures as “ugly”, “bad”, or “wrong”:

  • ugly—A figure that has aesthetic problems but otherwise is clear and informative.
  • bad—A figure that has problems related to perception; it may be unclear, confusing, overly complicated, or deceiving.
  • wrong—A figure that has problems related to mathematics; it is objectively incorrect.

Examples of ugly, bad, and wrong figures. (a) A bar plot showing three values (A = 3, B = 5, and C = 4). This is a reasonable visualization with no major flaws. (b) An ugly version of part (a). While the plot is technically correct, it is not aesthetically pleasing. The colors are too bright and not useful. The background grid is too prominent. The text is displayed using three different fonts in three different sizes. (c) A bad version of part (a). Each bar is shown with its own y-axis scale. Because the scales don’t align, this makes the figure misleading. One can easily get the impression that the three values are closer together than they actually are. (d) A wrong version of part (a). Without an explicit y axis scale, the numbers represented by the bars cannot be ascertained. The bars appear to be of lengths 1, 3, and 2, even though the values displayed are meant to be 3, 5, and 4.

(ref:ugly-bad-wrong-examples)

(ref:ugly-bad-wrong-examples)

Visualizing data: Mapping data onto aesthetics

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Whenever we visualize data, we take data values and convert them in a systematic and logical way into the visual elements that make up the final graphic. Even though there are many different types of data visualizations, and on first glance a scatter plot, a pie chart, and a heatmap don’t seem to have much in common, all these visualizations can be described with a common language that captures how data values are turned into blobs of ink on paper or colored pixels on screen. The key insight is the following: All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics.

Aesthetics and types of data

Aesthetics describe every aspect of a given graphical element. A critical component of every graphical element is of course its position, which describes where the element is located. In standard 2d graphics, we describe positions by an x and y value, but other coordinate systems and one- or three-dimensional visualizations are possible. Next, all graphical elements have a shape, a size, and a color. Even if we are preparing a black-and-white drawing, graphical elements need to have a color to be visible, for example black if the background is white or white if the background is black. Finally, to the extent we are using lines to visualize data, these lines may have different widths or dash–dot patterns. Beyond this handful of examples, there are many other aesthetics we may encounter in a data visualization. For example, if we want to display text, we may have to specify font family, font face, and font size, and if graphical objects overlap, we may have to specify whether they are partially transparent.

Commonly used aesthetics in data visualization are position, shape, size, color, line width, line type. Some of these aesthetics can represent both continuous and discrete data (position, size, line width, color) while others can usually only represent discrete data (shape, line type).

(ref:common-aesthetics)

(ref:common-aesthetics)

All aesthetics fall into one of two groups: Those that can represent continuous data and those that can not. Continuous data values are values for which arbitrarily fine intermediates exist. For example, time duration is a continuous value. Between any two durations, say 50 seconds and 51 seconds, there are arbitrarily many intermediates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and so on. By contrast, number of persons in a room is a discrete value. A room can hold 5 persons or 6, but not 5.5. For the examples in Figure @ref(fig:common-aesthetics), position, size, color, and line width can represent continuous data, but shape and line type can usually only represent discrete data.

Next we’ll consider the types of data we may want to represent in our visualization. You may think of data as numbers, but numerical values are only two out of several types of data we may encounter. In addition to continuous and discrete numerical values, data can come in the form of discrete categories, in the form of dates or times, and as text (Table @ref(tab:basic-data-types)). When data is numerical we also call it quantitative and when it is categorical we call it qualitative. Variables holding qualitative data are factors, and the different categories are called levels. The levels of a factor are most commonly without order (as in the example of “dog”, “cat”, “fish” in Table @ref(tab:basic-data-types)), but factors can also be ordered, when there is an intrinsic order among the levels of the factor (as in the example of “good”, “fair”, “poor” in Table @ref(tab:basic-data-types)).

Visualizing amounts

The most common approach to visualizing amounts (i.e., numerical values shown for some set of categories) is using bars, either vertically or horizontally arranged. However, instead of using bars, we can also place dots at the location where the corresponding bar would end.

If there are two or more sets of categories for which we want to show amounts, we can group or stack the bars. We can also map the categories onto the x and y axis and show amounts by color, via a heatmap.

## Distributions

Histograms and density plots provide the most intuitive visualizations of a distribution, but both require arbitrary parameter choices and can be misleading. Cumulative densities and quantile-quantile (q-q) plots (Chapter @ref(ecdf-qq)) always represent the data faithfully but can be more difficult to interpret.

Boxplots, violins, strip charts, and sina plots are useful when we want to visualize many distributions at once and/or if we are primarily interested in overall shifts among the distributions . Stacked histograms and overlapping densities allow a more in-depth comparison of a smaller number of distributions, though stacked histograms can be difficult to interpret and are best avoided. Ridgeline plots can be a useful alternative to violin plots and are often useful when visualizing very large numbers of distributions or changes in distributions over time.

Proportions

Proportions can be visualized as pie charts, side-by-side bars, or stacked bars, and as in the case for amounts, bars can be arranged either vertically or horizontally. Pie charts emphasize that the individual parts add up to a whole and highlight simple fractions. However, the individual pieces are more easily compared in side-by-side bars. Stacked bars look awkward for a single set of proportions, but can be useful when comparing multiple sets of proportions (see below).

When visualizing multiple sets of proportions or changes in proportions across conditions, pie charts tend to be space-inefficient and often obscure relationships. Grouped bars work well as long as the number of conditions compared is moderate, and stacked bars can work for large numbers of conditions. Stacked densitiesare appropriate when the proportions change along a continuous variable.

xy relationships

Scatterplots represent the archetypical visualization when we want to show one quantitative variable relative to another. If we have three quantitative variables, we can map one onto the dot size, creating a variant of the scatterplot called bubble chart. For paired data, where the variables along the x and the y axes are measured in the same units, it is generally helpful to add a line indicating x = y. Paired data can also be shown as a slope graph of paired points connected by straight lines.

For large numbers of points, regular scatterplots can become uninformative due to overplotting. In this case, contour lines, 2D bins, or hex bins may provide an alternative. When we want to visualize more than two quantities, on the other hand, we may choose to plot correlation coefficients in the form of a correlogram instead of the underlying raw data.

Uncertainty

Error bars are meant to indicate the range of likely values for some estimate or measurement. They extend horizontally and/or vertically from some reference point representing the estimate or measurement. Reference points can be shown in various ways, such as by dots or by bars. Graded error bars show multiple ranges at the same time, where each range corresponds to a different degree of confidence. They are in effect multiple error bars with different line thicknesses plotted on top of each other.

What’s ggplot?

ggplot2 is a package belonging to the tidyverse suite provinding a complete set of tools to visualize tidy data. ggplot2, is a graphing library in R that makes beautiful graphs, though its syntax can be formidably complex, with a somewhat steep learning curve.

That being said, learning ggplot2 is worth the effort for a couple of reasons. First, the graphs are beautiful. Second, ggplot2’s syntax, though seemingly arcane at times, forces you to think about the nature of your data, and the ideas that you are graphing. Lastly, a little bit of knowledge about ggplot2 can go a long way, and can build a powerful foundation for future learning.

Why use ggplot2?

  • Automatic legends, colors, etc.

  • Easy superposition, facetting, etc.

  • Nice rendering (yet, I don’t like the default grey theme).

  • Store any ggplot2 object for modification or future recall.

  • Lots of users (less bugs, much help on Stack Overflow).

  • Lots of extensions.

  • Nice saving option.

Learning ggplot

There are 3 essential elements to any ggplot call:

  1. An aesthetic that tells ggplot which variables are being mapped to the x axis, y axis, (and often other attributes of the graph, such as the color fill). Intuitively, the aesthetic can be thought of as what you are graphing.
  2. A geom or geometry that tells ggplot about the basic structure of the graph. Intuitively, the geom can be thought of as how you are graphing
  3. Other options, such as a graph title, axis labels and overall theme for the graph.

The basic syntax and elements of a ggplot call are something close to this:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Take a loof at the reference guide from the tidyverse site, visit the sthda website or check ggplot2’s cheatsheet:

Pipelines and ggplot2

Since ggplot2 it’s a member of tidyverse it was design to be operable from dplyrs pipelines. That means we can use any number of verbs to manipulate a data frame and then pass it to ggplot() using %>%. After that we keep using the regular + operator to concatenate plot instructions:

fires %>%
  ggplot(aes(x=log(BAREA))) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Basic plots with ggplot2

Before moving towards more fancy plots let’s cover the very basics of plotting and plot types:

  • Histogram and density: geom_histogram and geom_density. Aesthetics just takes an x variable.
  • Distributions: geom_boxplot. Aesthetics just takes an y variable.
  • Line plots: and geom_line. They usually take both x and y.
  • Bar plots: Similar to histograms but we must pass a stat argument:
  • identity: take the raw value.
  • count: frequency counts.
  • Scatter plots: geom_point. used to draw relationships between two variables. They require both x and y.
fires %>%
  ggplot(aes(x=log(BAREA))) +
    geom_density()

trees %>%
  ggplot(aes(x=DiamIf3,y=HeiIf3)) +
    geom_point()

Default themes

ggplot2 offers several pre-set styling to rapidly amend the uglyness of the default settings:

  • theme_minimal
  • theme_bw
  • theme_light
  • theme_grey
  • theme_dark
fires %>%
  ggplot(aes(x=log(BAREA))) +
    geom_density()+
    theme_light()

Exercise 1

Try to apply the predefined themes in ggplot.

Controlling the axis

We can adapt any element of a plot. Very often we will like to modify the limits of the x and y axis, the labels or the markers and ticks.

http://www.sthda.com/english/wiki/ggplot2-axis-scales-and-transformations

Setting up our own theme

newtheme <- theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.background = element_rect(color='white'),
        axis.text.x = element_text(face = "bold", 
                                   size = 7, 
                                   vjust = 0.5,
                                   angle = 0))

Exercise 2

Take a look into the following plots. Try to reproduce them by running the code. Identify any potential issues and correct them in case there are any.

Plotting time series

fires %>%
  filter(BAREA>0,YEAR>1974) %>%
  group_by(YEAR) %>%
  summarise(n=n(),BA=sum(BAREA))%>%
  ggplot(aes(x=YEAR,y=n)) +
    geom_line() + 
    geom_smooth() +
    newtheme +
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

fires$CAUSA.T <- fires$CAUSA
fires$CAUSA.T <-recode(fires$CAUSA.T, '1'= 'Natural','2'='Accident','3'='Accident','4'= 'Arson','5'='Unknown','6'='Restarted')
fires$CAUSA.T = factor(fires$CAUSA.T, levels=c('Natural','Accident',
                                               'Arson','Unknown','Restarted'))

fires %>%
  filter(YEAR>1974) %>%
  group_by(YEAR,MONTH,CAUSA.T) %>%
  summarise(n=n()) %>%
  ggplot(aes(y=YEAR,x=factor(MONTH),fill=log(n))) +
  geom_tile(color="white") +
  scale_x_discrete(name='MONTH',limits = c(1:12), expand = c(0, 0)) +
  scale_y_continuous(limits = c(1974,2015), expand = c(0, 0)) +
  scale_fill_viridis_c(option = 'B') +
  facet_wrap(~CAUSA.T) +
  newtheme

Barplots

ba.mean <- as.numeric(fires %>%
  group_by(YEAR) %>%
  summarise(BA=sum(BAREA))%>%
  summarise(mean(BA)))

fires %>%
  filter(YEAR>1974) %>%
  mutate(LARGE = ifelse(BAREA>500,"Large Fire","Fire")) %>%
  group_by(YEAR,LARGE) %>%
  summarise(n=n(),BA=sum(BAREA)) %>%
  ggplot(aes(x=YEAR,y=BA,group=LARGE,fill=LARGE)) +
    geom_bar(stat = 'identity') +
    scale_fill_manual(name='',
                      values = c('Fire'= '#EFC000FF','Large Fire'='#0073C2FF')) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_x_continuous(limits = c(1975,2015), expand = c(0, 0)) +
    labs(x='',y='% área grandes incendios') +
    geom_hline(yintercept = ba.mean,col = "red",lty=2) +
    newtheme +
    theme(legend.position='bottom',
          axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank())
## Warning: Removed 4 rows containing missing values (geom_bar).

fires %>%
  filter(YEAR>1974) %>%
  mutate(LARGE = ifelse(BAREA>500,"Large Fire","Fire")) %>%
  group_by(YEAR,LARGE) %>%
  summarise(n=n(),BA=sum(BAREA)) %>%
  mutate(fracc = BA / sum(BA)) %>%
  ggplot(aes(x=YEAR,y=fracc,group=LARGE,fill=LARGE)) +
    geom_bar(stat = 'identity') +
    scale_fill_manual(name='',
                      values = c('Fire'= '#EFC000FF','Large Fire'='#0073C2FF')) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_x_continuous(limits = c(1975,2015), expand = c(0, 0)) +
    labs(x='',y='% área grandes incendios') +
    newtheme +
    theme(legend.position = 'bottom')
## Warning: Removed 4 rows containing missing values (geom_bar).

fires %>%
  filter(YEAR>1974) %>%
  mutate(LARGE = ifelse(BAREA>500,"Large Fire","Fire")) %>%
  group_by(YEAR,LARGE) %>%
  summarise(n=n(),BA=sum(BAREA)) %>%
  ggplot(aes(x=YEAR,y=BA,group=LARGE,fill=LARGE)) +
  geom_bar(stat = 'identity') +
  scale_fill_manual(name='',
                    values = c('Fire'= '#EFC000FF','Large Fire'='#0073C2FF')) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_continuous(limits = c(1975,2015), expand = c(0, 0)) +
  labs(x='',y='Área grandes incendios (ha)') +
  geom_hline(yintercept = ba.mean,col = "red",lty=2) +
  facet_wrap(~LARGE) +
  newtheme +
  theme(legend.position='none',
        axis.line.x = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())
## Warning: Removed 4 rows containing missing values (geom_bar).

## Donut and pie plots

mycols <- c("#0073C2FF", "#EFC000FF", "#868686FF", "#CD534CFF")

fires %>%
  filter(CAUSA!=6)%>%
  group_by(CAUSA.T) %>%
  summarise(n=n(),BA=sum(BAREA)) %>%
  mutate(f=round(n/sum(n)*100,1)) %>%
  arrange(desc(f)) %>%
  mutate(lab.ypos = cumsum(f) - 0.5*f)%>%
  ggplot(aes(x=2 ,y=f, fill=CAUSA.T))+
    geom_bar(width = 1, stat = "identity", color = "white") +
    coord_polar("y", start = 0)+
    geom_text(aes(y = lab.ypos, label = paste(f,'%')), color = "white")+
    scale_fill_manual(values = mycols) +
    theme_void()+
  xlim(0.5, 2.5)

## Scatter plots

trees %>%
  ggplot(aes(x=DiamIf3,y=HeiIf3)) +
  geom_hex() +
  scale_fill_viridis_c()

trees %>%
  ggplot(aes(x=DiamIf3,y=HeiIf3)) +
  stat_bin2d() +
  scale_fill_viridis_c()

trees %>%
  ggplot(aes(x=DiamIf3,y=HeiIf3)) +
  geom_hex() +
  scale_fill_viridis_c()+
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Mapping spatial information

We can read vector GIS layers in several ways. The sf package offers a collection of methods fully interoperable with the tidyverse approach:

Reading vector layers:

fire.shape <- read_sf('data/cuad_nfires_var_pen.shp')

And them we can manipulate them with dply and plot with ggplot2:

fire.shape %>%
  dplyr::select(X1988:X2012) %>%
  mutate(NFires = rowSums(.[,1:25,drop=TRUE], na.rm = TRUE)) %>%
    ggplot(aes(fill=NFires)) +
      geom_sf(color=NA) +
      scale_fill_viridis_c(option='B') +
      theme_light()
## Warning in st_is_longlat(x): bounding box has potentially an invalid value
## range for longlat data

We can also work with raster layers. We leverage the raster and rgdal packages to get access to raster layers and then we convert them into a data frame so that we can plot it using geom_raster:

library(raster)
r <- raster('data/fire_occurrence.tif')
r.df <- raster::as.data.frame(r, xy=TRUE)
ggplot(data=r.df, aes(x=x, y=y, fill=fire_occurrence)) +
  geom_raster() +
  scale_fill_viridis_c() +
  theme_light()

Wilke, C.O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media, Incorporated. https://books.google.es/books?id=L3ajtgEACAAJ.

Marcos Rodrigues

2020/02/03